Skip to content

add python accelerator HAL for multi-vendor backends#86

Closed
theap06 wants to merge 24 commits intofacebookresearch:mainfrom
theap06:feat/accelerator-hal-upstream-clean
Closed

add python accelerator HAL for multi-vendor backends#86
theap06 wants to merge 24 commits intofacebookresearch:mainfrom
theap06:feat/accelerator-hal-upstream-clean

Conversation

@theap06
Copy link
Copy Markdown
Contributor

@theap06 theap06 commented Mar 3, 2026

Introduce a hardware-agnostic accelerator abstraction layer with normalized metrics, backend management, and runtime probing. Includes a functional NVML backend plus ROCm/LevelZero/TPU/Neuron scaffolds and dedicated HAL tests.

Adds a Python-first hardware-agnostic accelerator HAL at gcm/monitoring/accelerator.
Decouples telemetry collection from NVML-only assumptions via a common backend interface and normalized metrics.
Implements functional NVMLBackend; adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

Implements Feature Request #74

Test Plan:

Ran HAL tests:
12 passed

pytest -q gcm/tests/test_accelerator_hal.py 

gcm --backend=nvml nvml_monitor --sink=stdout --once --log-folder=/tmp/gcm-log

[{"gpu_id": 0, "hostname": "129-159-36-52", "mem_util": 0, "gpu_util": 0, "mem_used_percent": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0, "job_id": null, "job_user": null, ...}]
[{"gpu_index": 0, "mem_util": 0, "mem_used_percent": 0, "gpu_util": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0}, {"max_gpu_util": 0.0, "min_gpu_util": 0.0, "avg_gpu_util": 0.0, "ram_util": 0.42}]

echo $? 
Output: 0

health_checks --backend=nvml check-nvidia-smi fair_cluster nagios --sink=stdout -c gpu_num --gpu_num=0

[{"node": "129-159-36-52", "cluster": "fair_cluster", "derived_cluster": "fair_cluster", "health_check": "nvidia smi", "type": "nagios", "result": 0, "_msg": "Number of GPUs present is the same as expected, 0", "job_id": 0, "1772265789.8423011": <float>, "1772265789.8663087": <float>}]

echo $? 
Output: 0

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@theap06 theap06 marked this pull request as draft March 3, 2026 06:35
@theap06 theap06 marked this pull request as ready for review March 3, 2026 07:05
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 4, 2026

Could I get some feedback? This is tackling my aforementioned feature request on supporting other hardware

Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

lets just keep NVML since its the only backend supported in this PR

Comment thread gcm/monitoring/accelerator/backends/levelzero.py Outdated
@theap06 theap06 requested a review from luccabb March 4, 2026 08:33
Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some integration to existing paths, then run the cmds before and after these changes to prove its all working as expected:

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

add the above to the test plan

@theap06 theap06 requested a review from A-Kokolis as a code owner March 4, 2026 09:00
@theap06 theap06 requested a review from luccabb March 4, 2026 09:01
@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 4, 2026

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

Comment thread gcm/tests/data/sample-sacct-fake-line-break-expected.txt
@theap06 theap06 requested a review from luccabb March 4, 2026 19:12
@theap06 theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 44b06c0 to 49fa293 Compare March 4, 2026 19:25
Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

@theap06 theap06 requested a review from luccabb March 4, 2026 22:26
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 8, 2026

@luccabb the hardware backend works and the functionality works on my end with a gpu instance. let me know if you have any other feedback

Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 9, 2026

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

Done.

@theap06 theap06 requested a review from luccabb March 9, 2026 02:10
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 10, 2026

@luccabb could I get some feedback? I added the testing for HAL in the pr message.

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 11, 2026

@claude can you review this pr?

@claude
Copy link
Copy Markdown

claude bot commented Mar 11, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 11, 2026

@claude can you review this pr?

@claude
Copy link
Copy Markdown

claude bot commented Mar 11, 2026

Claude finished @luccabb's task in 1m 52s —— View job


PR Review: add python accelerator HAL for multi-vendor backends

  • Gather context and understand request
  • Read CLAUDE.md for repo guidelines
  • Read changed files
  • Provide comprehensive review

Overall, this is a clean, well-structured HAL abstraction. The module layout, error hierarchy, Protocol-based interface, and test coverage are all solid. A few issues need to be addressed before merging.


Critical Issues

1. --backend option is accepted but never used

In both gcm/monitoring/cli/gcm.py:43 and gcm/health_checks/cli/health_checks.py:39, the backend parameter is immediately discarded:

def main(detach: bool, backend: str) -> None:
    _ = backend   # ← never forwarded anywhere

The option appears in --help and is validated by Click, but it does nothing. The new AcceleratorManager/NVMLBackend is never instantiated from the CLI path. The nvml_monitor subcommand still goes directly through the old DeviceTelemetryClient. Until this wiring is added, the HAL is not actually used at runtime. Fix this →


2. get_vbios_version() used to populate the model field — semantically wrong

In gcm/monitoring/accelerator/backends/nvml.py:92:

model = client.get_device_by_index(index).get_vbios_version()

get_vbios_version() returns a firmware version string (e.g. "90.00.38.00.09"), not the GPU model name (e.g. "A100 80GB"). The DeviceHandle.model field implies a human-readable hardware model. This should use get_name() or similar if available on the client. Fix this →


Minor Issues

3. NVMLBackend.close() does not close the underlying client

In gcm/monitoring/accelerator/backends/nvml.py:174-176:

def close(self) -> None:
    self._client = None   # ← just drops the reference; no shutdown call
    return None

If DeviceTelemetryClient holds resources (e.g. NVML context), this leaks them. The close should call self._client.close() (or equivalent) before clearing the reference. Fix this →

4. BackendOperationError stores cause redundantly

In gcm/monitoring/accelerator/errors.py:20-27, cause: Exception is stored as a dataclass field and set as the chained exception via raise ... from e at the call sites. This means the exception appears twice. The __str__ method also manually formats it. Consider dropping the explicit cause field and relying solely on Python's built-in exception chaining (__cause__).

5. NVMLBackend subclasses AcceleratorBackend (a Protocol)

In gcm/monitoring/accelerator/backends/nvml.py:51:

@dataclass
class NVMLBackend(AcceleratorBackend):

AcceleratorBackend is defined as a Protocol. You don't need to explicitly inherit from a Protocol to satisfy it — you just need to implement the methods. Subclassing a Protocol creates a coupling that isn't needed and is non-idiomatic. NVMLBackend should just be a plain class (or dataclass) that happens to implement all the AcceleratorBackend methods.


Observations / Non-blocking

  • MetricRequest.include_process_info is ignored in read_metrics: The _request parameter in NVMLBackend.read_metrics is never read. If process info collection is planned, a TODO comment would clarify intent.
  • probe() is called before enumerate_devices() in tests but not enforced: Nothing stops a caller from calling enumerate_devices() on an unprobed backend (the client will be lazily initialized, which may raise). A guard or documented contract would help.
  • Test coverage is good — the 12 HAL unit tests cover normal operation, partial failures, missing library, fallback paths, manager lifecycle, and registry. The new integration tests in test_gcm.py gracefully handle no-GPU environments.
  • README is clear and well-written — the lifecycle section and backend authoring guide are helpful.

Summary

The HAL design is solid and the tests are thorough. The main issue is that the --backend flag is not wired to the HAL — the new code is tested in isolation but not yet integrated into the runtime monitoring path, which undermines the core goal of the PR. The model field bug and close() resource leak should also be fixed. Happy to see this land once those are addressed.

Comment thread .github/workflows/gcm_python.yml Outdated
Comment thread gcm/exporters/telemetry.py Outdated
@theap06 theap06 requested a review from luccabb March 18, 2026 04:08
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 18, 2026

@claude review my pr

Comment thread .github/workflows/gcm_python.yml
Comment thread .github/workflows/gcm_python.yml
@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 18, 2026

@claude

@claude
Copy link
Copy Markdown

claude bot commented Mar 18, 2026

Claude finished @luccabb's task in 1m 8s —— View job


PR Re-Review: add python accelerator HAL for multi-vendor backends

  • Gather context
  • Read changed files and CI diff
  • Provide updated review

One Issue Blocking Merge

.github/workflows/gcm_python.yml deletes the Rust installation step that exists on main

The PR removes these 5 lines from main (around line 67):

      - name: Install Rust
        run: |
          curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
          cp -r $HOME/.cargo/bin/* $HOME/.cache/venv-ci/bin/

This step was added to main independently to fix a build dependency that requires Rust. It has nothing to do with this PR. Removing it here means merging this PR would silently break the Rust-dependent build on main.

The author's explanation ("reverted it since I removed the related code from this PR") suggests they added this step in their branch and are now reverting their own addition — but this step was already on main before this PR branched off. The diff is subtractive relative to main, not just undoing a change introduced in this branch.

Fix: Restore those 5 lines so the workflow file is identical to main. This PR should not touch .github/workflows/gcm_python.yml at all. Fix this →


Everything else from all prior reviews remains fixed. This is the only remaining change needed before merge.

@theap06 theap06 requested a review from luccabb March 18, 2026 16:50
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 18, 2026

@luccabb added back the rust dep

Comment thread website/docs/GCM_Monitoring/accelerator_hal.md
Comment thread gcm/accelerator/__init__.py
Comment thread dev-requirements.txt Outdated
Comment thread gcm/monitoring/accelerator/backends/nvml.py Outdated
Comment thread gcm/accelerator/backends/nvml.py Outdated
@theap06 theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 8694aef to 5e0fbb9 Compare March 19, 2026 06:21
@theap06 theap06 requested a review from luccabb March 19, 2026 06:22
Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want this PR to be in a state where (1) ALL existing NVML paths from monitoring and health checks are going through the new accelerator interface and behavior is the same as before the PR. After (1) is accomplished add test plan that exercises every code path that was updated with before and after sections to prove that its the same functionality.

Comment thread gcm/accelerator/metrics.py Outdated
Comment thread gcm/accelerator/backends/nvml.py Outdated
Comment thread gcm/accelerator/backends/nvml.py Outdated
Comment thread gcm/accelerator/backends/nvml.py Outdated
@theap06 theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 5e0fbb9 to d694a52 Compare March 20, 2026 05:38
@theap06 theap06 requested a review from luccabb March 20, 2026 05:39
Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address #86 (review)

@theap06 theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from d694a52 to 3989d0d Compare March 20, 2026 21:35
@theap06 theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 3989d0d to ccc9129 Compare March 20, 2026 21:37
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 20, 2026

please address #86 (review)

addressed

@theap06 theap06 requested a review from luccabb March 20, 2026 23:06
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 25, 2026

@luccabb could i get some feedback on this pr

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Apr 1, 2026

@luccabb it should be ready to merge

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Apr 5, 2026

@luccabb will close this pr.

@theap06 theap06 closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants